Groq: The King of Performance for AI Inference!

01 Groq beats Nvidia GPU with its strength

Recently, AI chip startup Groq launched the fastest large model inference chip LPU, which is a dedicated ASIC chip for large model inference developed by the original Google TPU team. The inference API based on Groq chip is also open to the outside world.

Developers can apply for a free trial on the company’s official website (https://wow.groq.com/), or experience Groq for free on Poe: https://poe.com/Mixtral-8x7b-Groq .

Currently, there are two super-large models to choose from, Llama-70B-4K and Mixtral-8x7B-32K, and the inference API is fully compatible with OpenAI’s API.

What amazes users is that the AI inference speed based on Groq LPU is quite explosive. For example, in the question and answer scenario, using the Llama 70B model, the user almost does not feel the delay from asking the question to receiving the answer, and the delay of the first word output is only 0.2 seconds. More than 500 words are generated in about one second, while the same number of content, ChatGPT takes nearly 10 seconds to generate, and the first word output is even more in seconds.

Specifically to the throughput indicators, Groq gives a set of comparison data, as shown in the following figure, compared with the excellent players in the industry, Groq LPU is about 10 times faster, completely hanging the inference products based on Nvidia GPU.

Groq LPU is mainly designed for large model inference scenarios, especially in scenarios where there are a large number of concurrent users, it performs better. The excellent inference speed also means lower inference cost. From the inference price released by Groq, its pricing is already the lowest in the industry. Compared with GPT-3.5, the Mixtral-8x7B MoE model is about 4 times cheaper.

According to reports, Groq LPU’s inference speed is 10 times faster than Nvidia GPU, but the cost is reduced to one-tenth, and the cost performance is increased by 100 times. It can be described as “fast and cheap”, meeting user needs in terms of speed and cost.

There is a big question about the cost. The official inference price is very cheap, but many people will naturally doubt how Groq can achieve the industry-leading inference price with such a high single-card price of 20,000 US dollars?

Here we take Llama-70B as an example to speculate on its cost composition: To run a large model of Llama-70B, a configuration of about 256 Groq LPU cards is required, which is equivalent to 4 server racks, each rack accommodates 64 LPU cards, compared with that, an eight-card H100 GPU server can also effectively run this model.

But from the hardware cost point of view, Groq LPU server is much more expensive than Nvidia.

A reasonable explanation is that on the one hand, the official price of Groq LPU card is high, and it will definitely drop significantly in the future, so that the TCO cost of local deployment of users is equal to Nvidia server, on the other hand, this configuration is more suitable for scenarios where there are a large number of concurrent users. Through the method of high throughput and large concurrency, the cost of a single user’s single request is significantly reduced.

The high concurrency of inference requests is determined by the unique underlying architecture of Groq LPU. Its memory units are interleaved with vector and matrix function units, which can take advantage of the inherent parallelism of machine learning workloads to accelerate inference.

At the same time as the operation processing, each TSP also has the function of network switching, which can directly exchange information with other TSPs through the network without relying on external network devices. This design improves the parallel processing ability and efficiency of the system.

Unlike Nvidia GPU, which relies on high-speed data transmission, Groq LPU does not use high-bandwidth memory (HBM) in its system, but uses SRAM, which is about 20 times faster than the memory used by GPU. This design can also significantly increase throughput.

In terms of software stack development, Groq LPU supports inference through standard machine learning frameworks such as PyTorch and TensorFlow. In addition, Groq also provides a compilation platform and localization deployment solution, allowing users to use the Groq compiler to compile their own applications according to specific scenarios to obtain better performance and latency indicators.

02 Groq’s unique hardware architecture

Leading performance by 10 times Groq’s members come from the core team of Google’s TPU. They have a unique understanding and industry recognition of AI chip design. This recognition is ultimately reflected in the design concept of Groq LPU: advocating “software-defined hardware”, that is, using a single core configuration calculation and storage unit, and all operations are set by software.

This architecture is called TSP (Tensor Stream Processor), which is relatively simple in terms of hardware design, eliminates all unnecessary control This architecture is called TSP (Tensor Stream Processor), which is relatively simple in terms of hardware design. It removes all unnecessary control logic and transfers all control to the software compiler to complete, thereby optimizing the chip area allocation and achieving higher unit area computing power.

This hardware design architecture is simple, eliminates redundant circuits that are not conducive to AI computing, abandons DRAM, HBM and CoWoS packaging technology, directly uses SRAM high-performance storage, maintains a relatively high memory bandwidth while making cascaded expansion easier, and can support larger density Cluster configuration, especially suitable for large model inference scenarios with high performance, low latency, and compute-intensive.

Groq LPU’s TSP architecture uses its software-defined hardware method to optimize the efficient execution of machine learning and deep learning tasks6.

It provides a highly customized computing environment through static and dynamic partitioning, and optimized SIMD function unit layout, allowing the compiler to fully control the hardware at compile time, achieving fine-grained management and optimization of the execution process. This method not only enhances the determinism of computing, but also improves efficiency and performance by reducing runtime uncertainty.

In contrast, Nvidia’s A100 adopts the Ampere architecture, which supports a wide range of computing tasks, including but not limited to machine learning7. A100’s Tensor Core technology and support for multiple data types do provide powerful acceleration for deep learning, but TSP’s specialized optimization enables it to provide better performance and energy efficiency ratio on machine learning tasks8.

The design concept of TSP is to provide tailor-made solutions for specific workloads, thereby achieving performance beyond the general GPU architecture on these tasks.

In terms of chip design, Groq adopts a spatial direction arrangement, and the function units are arranged adjacent to each other. They cooperate by passing operands and results to each other, forming an efficient operation mode called “chaining”, which makes the output of a function unit directly connected to the input of the adjacent downstream function unit. Its core is a large MXM module, which contains 409,600 multipliers, and utilizes the data parallel processing capabilities on the chip to provide a computing density of more than 1 TeraOps per square millimeter9.

The abstract construction of GroqChip shows multiple different function units, such as 320-element length SIMD units, matrix units MXM, vector units for pointwise operations, SXM units for data reshaping, and memory units for processing on-chip storage. These units are simplified into easy-to-control building blocks.

The chip adopts east-west streaming data transmission, optimizes spatial locality, captures the locality of data streams, which is especially important for machine learning models.

In terms of memory selection, Groq’s LPU design does not use high-bandwidth memory (HBM), but chooses static random access memory (SRAM), which is about 20 times faster than HBM used by GPU.

This design choice reflects Groq’s optimization strategy for AI inference computing, especially considering that AI inference has a smaller data demand than model training10. Groq’s LPU achieves higher energy efficiency by reducing the dependence on external memory.

In addition, Groq’s LPU adopts the Temporal Instruction Set Computer (TISC) architecture, which is fundamentally different from the GPU that relies on HBM for high-speed data transmission in terms of working principle.

One of the key advantages of the TISC architecture is that it reduces the frequency of loading data from memory, which not only helps alleviate the problem of HBM supply shortage, but also effectively reduces system costs11.

In AI processing scenarios, Groq’s LPU provides an alternative solution that does not require a special storage solution for Nvidia GPU due to its lower requirements for storage speed, demonstrating cost-effectiveness and performance advantages in specific application scenarios.

Compared with the N card A100 and 4090 currently used for inference tasks, the LPU has a bandwidth of up to 80 TB/s and a relatively low power consumption12

High bandwidth provides LPU with higher computing unit utilization. Compared with A100, Groq LPU’s utilization rate is almost full in GEMM (general matrix multiplication):

03 Looking at the development trend of AI Infra from Groq

Before the launch of Groq LPU, both large model training and inference were based on Nvidia GPU and designed using CUDA software technology stack13. The corresponding AI Infra technology is basically based on Nvidia GPU and CUDA ecology to optimize the software layer14.

The launch of Groq LPU allows us to see some obvious changes. It is not bad to make some predictions about the development trend of AI Infra technology.

First of all, the main battlefield of AI chips will shift from training to inference, and the AI inference market will grow significantly in the next year. Compared with AI training, AI inference is more closely related